home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
EnigmA Amiga Run 1998 July
/
EnigmA AMIGA RUN 29 (1998)(G.R. Edizioni)(IT)[!][issue 1998-07 & 08].iso
/
recent
/
grabur.lha
/
graburl
/
GrabURL.txt
< prev
next >
Wrap
Text File
|
1998-01-05
|
18KB
|
649 lines
GrabURL 2.0 - Selective URL fetching utility
Copyright (C) 1996-97 Serge Emond
----------------------------------------------------------------------------
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
----------------------------------------------------------------------------
Table of contents
~~~~~~~~~~~~~~~~~
1. This document
2. Config file
3. Arguments
4. Configuration examples
5. Url Completion
6. Filenames
7. One or 2 examples
8. The Author
1. This document
~~~~~~~~~~~~~~~~
Many things may be incorrect or missing in this document. If you really
want accurate informations, read the source code.
2. Config file
~~~~~~~~~~~~~~
Lines with '#' as the first non-whitespace character are comments.
Arguments to commands can be enclosed by " or ' to contain spaces.
2.1. Global commands
2.1.1 Section
Use: Section <SectionName>
You need a matching END command for each section. A section is a part of
the config file processed by another part of the program. Right now,
there are 2 known sections: "http" and "scan". Each section has it's own
set of commands and is processed independantly. There are also "global"
options that are outside a Section.
2.1.2 End
Use: End
You need one to mark the end of a section and return to the "global" part.
2.1.3 Include
Use: Include <filename>
Includes a new config file that is processed exactly the same way the
actual config is.
Example: you have multiple config files that share a common part, include
the common part at the beginning of the specialized config files.
2.1.4 Delay
Use: Delay <time>
This will make GrabURL pause between each file download to give a break to
your link and to internet in general. The argument is in 1/10 of a
second.
It is highly recommended to use some delay (ie 10 or 20) for downloads
that might get far in recursion, espetially if you have a fast internet
connection. Harassing a host is not nice for them and can get you
site-banned. Be nice and share the net!
Default value: 0
2.1.5 EMail
Use: EMail <your email>
This is the email address sent to the remote server for each request.
If you don't specify any EMail keyword, NO email information is sent.
You should send this information so site operators might email you to be
nicer instead of site ban you if you do something wrong. It could also
prevent a GrabURL ban. (Refusing all connections made with graburl)
2.1.6 SaveRoot
Use: SaveRoot <root directory>
This tells GrabURL where to save the files. You can use "." (or . or '.')
to save in currect directory. The default value is the current directory.
2.1.7 DirMode
Use: Dirmode <mode>
This is the mode (protection bits) for the directories created. See man
page for mkdir(2) for more information about the modes. The mode is an
octal number. The default value is 700 (octal), which means the
protections rwx------. First position is user bits, second is group bits
and third is others bits. The number is the sum of this: 1 for execute
bit (x), 2 for write bit (w) and 4 for read bit (r). Mode of existing
directories is NOT modified.
2.1.8 FileMode
Use: Filemode <mode>
This affects files the same way DirMode affects directories. The argument
also have the same meaning. If a file gets overwriten, then mode WILL be
modified. Default is 700 (octal).
2.1.9 Translate
Use: Translate <from> <to>
This modifies the filename and dir names. The first character of "from"
is changed for the first character in "to". The 2nd char of "from"
becomes the 2nd char of "to", etc...
"from" and "to" MUST have the same length.
Example: translate "~%()" "__++"
If you download http://moo.com/~hey/Puppet(7).html, instead of saving it
to "moo.com/~hey/Puppet(7).html" in the SaveRoot-directory, it will save
it to "moo.com/_hey/Puppet+7+.html".
2.1.10 Retry
Use: Retry
When present, GrabURL will retry every URL that failed ONCE to be
downloaded. I'm not even sure it is completly implemented. (!)
2.1.11 Scan
Use: Scan
When present, GrabURL will scan all HTML files and add the urls contained
in them to the download-list. A file is considered as being HTML if:
1- The server says it is html
2- it contains ".htm" in the filename. (ie "bobo.htmt.jpg" will be
scanned)
2.1.12 StayInHost
Use: StayInHost
When in config file, GrabURL will only add urls to the list (when a URL is
flagged SCAN, ie with the SCAN config-command, SCAN command-line option or
if specified in the workfile) only if the host name of the to-be-added
urls is the SAME as the host name of the scanned file.
Example: you scan the result of http://hop.com/. http://hop.com/2.gif
WILL be added to the list, but http://www.yahoo.com/ won't.
2.1.13 NonExists
Use: NonExists
No files gets overwritten on the disk. If the file already exists, the
url is skipped.
2.1.14 Depth
Use: Depth <depth>
Each file has a "depth", which is an integer. If the file is HTML and
scanned, the depth of each files added to the list from this file will
have it's depth minus one. (ie x.html has depth 4 and is scanned, all new
files will have a depth of 3).
A file with depth 0 will NOT be scanned.
A file with depth -1 will give a depth of -1 to its child and WILL be
scanned.
The default is -1 (scan all internet)
2.1.15 AddLog
Use: AddLog <filename> [<optional flags>]
This adds a log to which text information will be appended while GrabURL
runs.
Filename is simply the filename to add information to. There are 2
special filenames: stdout and stderr, which will give output to the
standard output and standard error streams, respectively.
Options are:
date add date stamp
time add time stamp
type add the type of output
# all characters from # to EOL are ignored (ie a comment)
For example,
AddLog "stderr" type
AddLog "/var/log/GrabURL.log" type date time
will do 2 things: send output to /var/log/GrabURL.log like this:
> 01/00/98 00:06:49 [1/1] http://localhost/
> 01/00/98 00:06:49 Received 292 bytes in 0 sec
and print this on the screen, via the error stream:
> [1/1] http://localhost/
> Received 292 bytes in 0 sec
Types:
? informations
* error
+ Scan stuff (additions to the list)
> Action like "Receiving file"
X Debug information (not much debug right now, and have to be compiled
with GU_DEBUG)
2.2 Section HTTP
The following options HAS to be enclosed by:
Section "http"
and
end
2.2.1 Auth
Use: Auth <realm> <userpw>
This is for password-protected HTTP files. The only authentication scheme
implemented is the only official one I know about: "basic" (see rfc1945).
"realm" is a code associed to the protected page(s) to let clients
(graburl, netscape, ...) know what password to use for that page. If you
dont know the realm of a password protected page, simply try to get it,
GrabURL will tell you what is the realm.
"userpw" has 2 parts: you user name and your password, separated by a
semi-colon. That part is "encrypted" using base64 and send directly to
the server. (So anyone that intercepts your request can know your
password ;)
Example config-file:
auth flyers 'joe:foo'
Auth "MoNgOlfier SeX life" james:tkirk
If GrabURL tries to get a protected page and the servers says the realm
is "flyers", the user name "joe" with password "foo" is used.
If the realm is "MoNgOlfier SeX life", user/password used are
james/tkirk.
Try to use the right case for the letters ;)
Can be used more than once.
2.2.2 RequestLine
Use: RequestLine <line>
This adds the line <line> exactly as you typed it to ALL HTTP requests.
CR/LF is appended.
2.3 Section SCAN
The following commands has to be enclosed between
Section "scan"
and
End
2.3.1 Key
Use: Key <key>
What I call a "key" (keyword) is a word in the HTML language that indicates
the presence of an URL following. In an HTML file, it will look like
this, for a key named "src":
<key key key src ../hop.jpg key key>
<key key src=../hop.jpg key>
<key key src="../hop.jpg" key>
etc...
This will add the relative url ../hop.jpg to the list.
I know 3 keys for now:
SRC which is used for images and frames
BACKGROUND which is used for... backgrounf image
HREF which is used to link to other images or HTML pages or email or
whatever.
I strongly suggest using this in your config file:
key SRC
key BACKGROUND
key HREF
2.3.2 Ignore
Use: Ignore <begin> [<end>]
Works a bit like "key". When the scanner encounters a tag containing
"begin" in it, it will skip EVERYTHING up to the moment it encounters the
matching "end" tag. If end is not specified, the skipping will stop when
the "begin" tag is encountered again.
2.3.3 Diese
Use: Diese <search|normal>
A diese "#" (number sign) is means, in the "url language", to search for a
label in an html file after displaying it. For example,
http://hey.com/hop.html and
http://hey.com/hop.html#contents
These urls both points to the SAME file, "http://hey.com/hop.html". In
the second one however, your browser (ie netscape) will search for a label
named "contents" and position the page accordingly.
When "diese search" is used, the part beginning with "#" is stripped from
the name and it will act NORMALLY, ie the # means to search in the page.
"diese normal" will make the # a normal character and the request for the
web page will be sent WITH "#".
You should always use "diese search", which is the default, unless you
have really specific and private use.
2.3.4 Skip
Use: Skip <regular expression>
This is a regular expression that, if matched, will make the to-be-added
URL rejected and NOT being added. You can use as many Skip Reg
Expressions as you wish, and if a single one of them is matched, the url
is rejected.
See "Pattern" below for example.
2.3.5 Pattern
Use: Pattern <regular expression>
This is a regular expression that MUST be matched for GrabURL to add an
URL to the list. You can use as many as you want and ALL of them will
have to be matched for an url to be appended to the list.
Example:
Pattern .jpe*g$
Skip foo
When an HTML file is scanned, the urls found in it will only be appended
to the download-list if:
1) "foo" is NOT in the url name
2) the url must end with .jpg, .jpeg, .jpeeg, .jpeeeg, ....
ie,
http://www.pod.ca/andrew.html rejected
http://www.foo.org/hey.jpeg rejected: contains "foo"
http://www.bar.org/hey.jPeEg accepted
3. Arguments
~~~~~~~~~~~~
Arguments are the information you pass to graburl via the command line.
ALL the arguments not matching one of the following is considered as an URL
to be added to the download-list.
3.1 Help
Arg: help
Syn: ?, -h, --help
Gives a summary of the commands supported by GrabURL.
3.2 License
Arg: license
Shows copyright notice and informations about warranty, etc.. (GPL)
3.3 Config
Arg: Config <config_file>
Syn: c <config_file>
Use this config file. The search order, when that options is not
specified, if ".graburlrc" in your home directory, then "/etc/graburlrc".
3.4 SaveRoot
Arg: SaveRoot <root_dir>
Syn: sr <root_dir>
Specifies the root-directory. See SaveRoot in config file for more infos.
3.5 InFile
Arg: Infile <file>
Syn: if <file>
Points to a file containing one URL per line. You can specify multiple
'infiles'.
3.6 WorkFile
Arg: Worfile <file>
Syn: wf <file>
Points to a work file in which each url is "saved" with his options and
status. When GrabURL READS the file, it can be simply one URL on each
line, like for the InFile option. Basically, each line is like this:
<type> <url> [<options>]
Type can be:
Q Queued - not processed/downloaded yet
R Received - it's on your disk now :)
F Failed - will be retried if you specified the RETRY option.
X Fatal error - document don't exists, ...
M Moved, the server generally gives a new URL and GrabURL will
automatically add it to the list.
(else) Skip that url, do nothing with it.
An exemple of moved url is if you try to get http://hop.com/vip and vip is
a directory, the server generally responds with a "moved" error and will
give GrabURL the new location "http://hop.com/vip/".
The URL is the url.. enclosed or not with " or '.
Options are:
TO=filename
Filename to save the file to
D=depth
depth of this file if not -1.
RETRY
retry flag for this url
NODIRS or ND
Do not create sub-directories for this url
NONEXISTS or NE
Do not download if a file of that name is on disk
SCAN
Scan this file, if HTML, for recursion.
STAYONHOST
Same as StayInHost config option
IFMODIFIED
Send an IfModified request to server
3.7 SaveTo
Arg: SaveTo <file>
Syn: st <file>
Saves the url to the given file. You may not specify a workfile or more
than one url when you use this option on the command line.
3.8 Retry
Arg: retry or noretry
Sets/unsets the RETRY flag.
3.9 Ifmodified
Arg: IfModified or NOIfModified
Syn: im or noim
Sets/unsets the IfModified flag. IfModified is a request sent to the
server telling the date of the file on disk, if we already have it.
If the server supports that option and our file is newer than theirs (ie
the file haven't been modified), the server tells us it is not modified
and dont send the file.
3.10 NonExists
Arg: NonExists or NONonExists
Syn: ne or none
Sets/unsets the NO_EXISTS flag. If set, with "NonExists" on commant line
or in config file, files dont get overwritten and url is skipped.
3.11 NoDirs
Arg: NoDirs or NONoDirs
Syn: nd or nond
Sets/unsets the nodir flag. See lower about how files are saved to disk.
3.12 SaveHeader
Arg: SaveHeader or NOSaveHeader
Syn: sh or nosh
See config file.
3.13 Scan
Arg: scan or Recursive
Syn: r
Scan files for recursive downloads. Use NOSCAN to turn it off if you
set it via the config file.
3.13.1 StayInHost
Arg: StayInHost
Syn: sih
The scanner only adds an url if it has the same hostname then it's parent.
3.13.2 Pattern
Arg: Pattern <regexp>
Syn: p <regexp>
See config file. You may have multiple patterns.
3.13.3 SkipPattern
Arg: SkipPattern <regexp>
Syn: sp <regexp>
See config file. You may use multiple skip patterns.
3.13.4 Depth
Arg: Depth <depth>
Syn: d <depth>
See config file.
4. Configuration example
~~~~~~~~~~~~~~~~~~~~~~~~
---------- Begins here
# This is a comment!!
Translate "~%()" "____"
SaveRoot .
FileMode 0700
DirMode 0700
Delay 5
AddLog "stderr" type
AddLog "GrabURL.log" type date time
Section "http"
# Realm yop, user grey, password moppe
Auth yop grey:moppe
RequestLine "Accept: */*"
End
Section "scan"
key SRC
key BACKGROUND
key HREF
ignore BLOCKQUOTE /BLOCKQUOTE
diese search
EndSection
----------- EOF
5. Url Completion
~~~~~~~~~~~~~~~~~
For now, GrabURL only understands HTTP. Wanted to add FTP but I dont have
the time or the need for it.
If you give GrabURL (on the command line, or in workfile or infile) a name,
alone, like "www.yahoo.com", graburl will complete it to
"http://www.yahoo.com/".
The scanner will also encounter relative urls in HTML files. Those urls are
not complete, the refer to a file relative to the actual document. For
example an url "../imgs/banner.jpg" in the file comming from
"http://hey.com/carlton/index.html" points in reality to the file
"http://hey.com/imgs/banner.jpg".
GrabURL understands names beginning with "/", containig "." and/or ".." and
will modify the url accordingly.
6. Filenames
~~~~~~~~~~~~
Before creating a file, GrabURL will change it's working directory to the
"saveroot" directory. Then it creates a directory with the name of the host
and cd to it. Then creates all the sub-directory names in the url and put
the file there.
For example, if saveroot is "/tmp", the url http://buz.com/~hey/ho.html will
be saved to /tmp/buz.com/~hey/ho.html.
If you have a translation for "~" to "_", it will be saved as
/tmpTbuz.com/_hey/ho.html.
If the option "nodirs" is specified, the filename fill be /tmp/ho.html.
If an url ends with "/", the filename will be "index.html".
7. One or 2 examples
~~~~~~~~~~~~~~~~~~~~
graburl www.umontreal.ca
-> will download http://www.umontreal.ca/
graburl www.umontreal.ca r sih
-> Downloads the complete site of Université de Montréal.
graburl www.umontreal.ca r p www.umontreal.ca
-> same as the previous one
graburl www.umontreal.ca r p www.umontreal.ca sp .gif$
-> same as above but don't download files ending with ".gif".
graburl www.umontreal.ca r sih wf hey.wf
- stop the transfert after 2-3 files then type
graburl wf hey.wf
-> Begins to download recursively as in the 2nd example, user aborts the
download, then you resume it.
8. How to contact the author
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
email: Serge Emond <greyl@videotron.ca>
Serge Emond <ei971807@uqac.uquebec.ca>
smail: Serge Emond
3392 des Anemones
Jonquiere, Quebec
Canada, G7S 5V4
You can also check the official GrabURL web page:
http://pages.infinit.net/greyl/graburl/
It doesn't contains much but you can download graburl from there.
Please dont write things like this:
"It only downloads the first file and doesn't do recursion, why?"
And please have a look at the README.txt file that came with this.